Linear Regression Models & Associated Diagnostic Plots
Model 1: Regressing Income on All Variables in Our Subset
| (Intercept) |
-17234 |
20175 |
-0.85 |
0.3931 |
| sexMale |
21434 |
1629 |
13.16 |
0.0000 |
| num.times.incarc |
-3239 |
1921 |
-1.69 |
0.0919 |
| longest.length.incarc.months |
-158 |
141 |
-1.12 |
0.2636 |
| condition.limiting.workYes |
-6593 |
3706 |
-1.78 |
0.0754 |
| citizenshipUnknown, not born in the U.S. |
10165 |
4603 |
2.21 |
0.0273 |
| citizenshipUnknown, birthplace unknown |
14665 |
3297 |
4.45 |
0.0000 |
| urban.status.age.12Rural |
-3392 |
1889 |
-1.80 |
0.0727 |
| residential.dad.highest.grade.completed |
237 |
341 |
0.69 |
0.4875 |
| residential.mom.highest.grade.completed |
1313 |
363 |
3.62 |
0.0003 |
| raceBlack |
-3664 |
2328 |
-1.57 |
0.1156 |
| raceHispanic |
-1645 |
2536 |
-0.65 |
0.5166 |
| raceMixed Race (Non Hispanic) |
-14102 |
9520 |
-1.48 |
0.1386 |
| num.children.under.6 |
-626 |
1105 |
-0.57 |
0.5709 |
| highest.degree.earnedGED |
2338 |
5101 |
0.46 |
0.6468 |
| highest.degree.earnedHigh School Diploma |
9365 |
4350 |
2.15 |
0.0314 |
| highest.degree.earnedAssociate |
17845 |
5055 |
3.53 |
0.0004 |
| highest.degree.earnedBachelors |
33999 |
4586 |
7.41 |
0.0000 |
| highest.degree.earnedMasters |
36818 |
5275 |
6.98 |
0.0000 |
| highest.degree.earnedPhD |
52522 |
12805 |
4.10 |
0.0000 |
| highest.degree.earnedProfessional Degree (DDS, JD, MD) |
128980 |
7990 |
16.14 |
0.0000 |
| marital.statusMarried |
11246 |
1838 |
6.12 |
0.0000 |
| marital.statusSeparated |
1851 |
7060 |
0.26 |
0.7932 |
| marital.statusDivorced |
4990 |
3025 |
1.65 |
0.0991 |
| marital.statusWidowed |
14510 |
16965 |
0.86 |
0.3925 |
| age.in.2017 |
423 |
572 |
0.74 |
0.4592 |
From this initial linear regression model, we can observe that, holding all other variables constant, men make 21,434 dollars more than their female counterparts. This coefficient estimate is statistically significant, with a p-value of 3.643331710^{-38}.
Both of the criminal history-related variables, num.times.incarc and longest.length.incarc.months both appear to be negatively correlated with income. As the number of times an individual has been incarcerated increases and as the length of a respondent’s longest period of incarceration increases, their income appears to decline. Neither of these coefficients appear to be statistically significant at a level of 0.05.
Respondents with an unknown birtplace and those who were not born in the U.S. appear to make more money on average, holding all other variables constant, than their counterparts who are U.S. citizens. A critical fact to note, however, is that individuals with an unknown birthplace make up only 794 of the respondents, so this effect could be overstated given the small sample size, relative to the number of respondents in the study who are citizens.
With regard to impact of race on income, Black, Hispanic, and Mixed Race (Non Hispanic) respondents appear to, on average, make less than their Non-Black, Non-Hispanic counterparts. However, this difference in earnings does not appear to be statistically significant.
The education-related variables appear to have the most number of statistically significant associations with income. As the level of the highest degree that individuals have received increases, their earnings also appear to increase. Study participants with a Professional Degree (DDS, JD, or MD) appear to make, on average, 128,980 dollars more than their counterparts who have not earned a degree. This coefficient estimate is statistically significant, with a p-value of 0.
Finally, individuals who are married appear to make, on average, 11,246 dollars more than their counterparts who are not married. This difference is statistically significant in this linear regression model, with a p-value of 1.105921710^{-9}.
From the R-squared value for this linear regression model, only about 30% of the variance in the data is explained by the model. To conduct further analysis on model fit, we also ran the plot() function to analyze the diagnostic plots for this model.




Residual vs Fitted Plot
The residual vs. fitted plot shows that, for the lower fitted values, there is some clustering around 0 but as the fitted values increase, there is a greater deviation away from a residual of 0. This could be caused, in part, by the top-coding of the income variable. Furthermore, as we observed from the earlier tabular and graphical summaries, the majority of respondents make around $50,000, therefore there are more observations on which to calculate the residuals in this part of the income spectrum. This isn’t the case, however, with the observations in the top income tier of the dataset.
Normal QQ Plot
The normal QQ plot shows a similar trend to the residual vs. fitted plot graph. For the majority of the curve, the residuals match almost perfectly to the diagonal, demonstrating that these residuals appear to be mostly normally distrbuted. Once we hit of a theoretical quantile of about 2 however, the residuals appear to drastically deviate from the diagonal line. Therefore, in the upper tail of the dataset, the assumption of normality does not appear to hold.
Scale-Location Plot
The scale-location plot appears to make a strong case that the assumption of constant variance does not hold for this model. As the fitted values increase, the variance increases steadily. This could also likely to be a side effect of the top-coding of the income variable.
Outliers & The Residuals vs Leverage Plot
The residuals vs leverage plot highlights that there are a few values in the dataset that appear to be outliers, as they lie on or near the Cook’s disance lines. Furthermore, these values have high leverage and high residuals.
Model 2: Addressing the Top-Coding of the Income Variable
The following linear regression is the same as the previous model (referred to as Model #1), where income is regressed on all variables in the subset. The notable exception is that observations with topcoded income are no longer in the subset.
| (Intercept) |
-8198 |
13740 |
-0.60 |
0.5508 |
| sexMale |
16267 |
1108 |
14.68 |
0.0000 |
| num.times.incarc |
-2866 |
1334 |
-2.15 |
0.0318 |
| longest.length.incarc.months |
-133 |
92 |
-1.44 |
0.1487 |
| condition.limiting.workYes |
-8840 |
2544 |
-3.47 |
0.0005 |
| condition.limiting.workValid Skip |
-32868 |
26134 |
-1.26 |
0.2086 |
| condition.limiting.workDon’t Know |
-42908 |
26177 |
-1.64 |
0.1013 |
| condition.limiting.workRefused to Answer |
18055 |
26267 |
0.69 |
0.4919 |
| citizenshipUnknown, not born in the U.S. |
5216 |
3175 |
1.64 |
0.1006 |
| citizenshipUnknown, birthplace unknown |
5666 |
2288 |
2.48 |
0.0134 |
| urban.status.age.12Rural |
-2608 |
1278 |
-2.04 |
0.0414 |
| urban.status.age.12Unknown |
-1555 |
5629 |
-0.28 |
0.7824 |
| residential.dad.highest.grade.completed |
133 |
230 |
0.58 |
0.5625 |
| residential.mom.highest.grade.completed |
502 |
248 |
2.02 |
0.0431 |
| raceBlack |
-4167 |
1571 |
-2.65 |
0.0080 |
| raceHispanic |
149 |
1702 |
0.09 |
0.9302 |
| raceMixed Race (Non Hispanic) |
-9654 |
6395 |
-1.51 |
0.1313 |
| num.children.under.6 |
-1161 |
756 |
-1.54 |
0.1247 |
| highest.degree.earnedGED |
5012 |
3485 |
1.44 |
0.1505 |
| highest.degree.earnedHigh School Diploma |
11329 |
2971 |
3.81 |
0.0001 |
| highest.degree.earnedAssociate |
20901 |
3453 |
6.05 |
0.0000 |
| highest.degree.earnedBachelors |
28663 |
3138 |
9.13 |
0.0000 |
| highest.degree.earnedMasters |
35334 |
3627 |
9.74 |
0.0000 |
| highest.degree.earnedPhD |
59464 |
8849 |
6.72 |
0.0000 |
| highest.degree.earnedProfessional Degree (DDS, JD, MD) |
66353 |
7218 |
9.19 |
0.0000 |
| highest.degree.earnedNon-Interview |
14490 |
5671 |
2.56 |
0.0107 |
| highest.degree.earnedInvalid Skip |
22933 |
7361 |
3.12 |
0.0019 |
| marital.statusMarried |
7730 |
1249 |
6.19 |
0.0000 |
| marital.statusSeparated |
1276 |
4601 |
0.28 |
0.7815 |
| marital.statusDivorced |
3909 |
2026 |
1.93 |
0.0538 |
| marital.statusWidowed |
3354 |
10734 |
0.31 |
0.7547 |
| marital.statusInvalid Skip |
2705 |
6840 |
0.40 |
0.6925 |
| age.in.2017 |
546 |
388 |
1.41 |
0.1595 |
From this linear regression model, we can observe that, holding all other variables constant, men make 16,267 dollars more than their female counterparts. This coefficient estimate is statistically significant, with a p-value of 9.066361110^{-47}.
The results of this model are very similar to Model #1, with some exceptions. The num.times.incarc variable is still negatively associated with income. In this model, unlike in Model #1 however, this relationship is statistically significant. There are some outliers in the comparison of income and criminal history (those with a history of incarceration and very high income). With the removal of these outliers, the model perhaps is better capturing the impact of criminal history on income.
With regard to the impact of race on income, Black, Hispanic, and Mixed Race (Non Hispanic) respondents appear to, on average, make less than their Non-Black, Non-Hispanic counterparts. In this model, the difference in earnings is statistically significant for Black respondents.
From the R-squared value for this linear regression model, only about 25% of the variance in the data is explained by the model. This R-squared is slightly lower than that of Model #1. To conduct further analysis on model fit, we also ran the plot() function to analyze the diagnostic plots for this model.




Residual vs Fitted Plot
The residual vs. fitted plot shows that, for the lower fitted values, there is some clustering around 0 but as the fitted values increase, there is a greater deviation away from a residual of 0. This plot suggests that the model is still overestimating income in lower income brackets and overestimating income in higher income brackets. The deviation in the upper tail of the dataset, however, is not as drastic in Model #2 as it was in Model #1.
Normal QQ Plot
The normal QQ plot shows a similar trend to the residual vs. fitted plot in the first model. Although there are still large deviations in the upper tail, the data in this model appear more normal than in Model #1.
Scale-Location Plot
As the fitted values increase, the variance still increases steadily. The slope of the standardized residuals, although not constant, is smaller than the slope in Model #1.
Outliers & The Residuals vs Leverage Plot
The residuals vs leverage plot show that the outliers in the first plot persist even with the removal of topcoded income.
Overall, it appears that removing the topcoded income values has addressed some of the non-constant variance from Model #1, but not all of it.
Models 3 & 4: Distilling the Impact of Gender & Criminal History on Income
Adding An Interaction Term between sex and num.times.incarc
From the tabular and graphical summaries conducted earlier, it was evident that were was a difference in the incarceration rate of respondents in the study based on their gender. Particuarly, more men than women in the study had been incarcerated and, among the population that had been incarcerated, men tended to serve longer sentences than women. With this information, we decided to take a look at the gender differences in income based on criminal history.
| (Intercept) |
-8230 |
13743 |
-0.60 |
0.5493 |
| sexMale |
16305 |
1116 |
14.61 |
0.0000 |
| num.times.incarc |
-1663 |
4367 |
-0.38 |
0.7034 |
| longest.length.incarc.months |
-133 |
92 |
-1.44 |
0.1511 |
| condition.limiting.workYes |
-8851 |
2545 |
-3.48 |
0.0005 |
| condition.limiting.workValid Skip |
-32837 |
26139 |
-1.26 |
0.2092 |
| condition.limiting.workDon’t Know |
-42909 |
26182 |
-1.64 |
0.1014 |
| condition.limiting.workRefused to Answer |
18083 |
26273 |
0.69 |
0.4914 |
| citizenshipUnknown, not born in the U.S. |
5230 |
3176 |
1.65 |
0.0998 |
| citizenshipUnknown, birthplace unknown |
5665 |
2289 |
2.47 |
0.0134 |
| urban.status.age.12Rural |
-2615 |
1278 |
-2.05 |
0.0409 |
| urban.status.age.12Unknown |
-1545 |
5630 |
-0.27 |
0.7838 |
| residential.dad.highest.grade.completed |
133 |
231 |
0.58 |
0.5651 |
| residential.mom.highest.grade.completed |
503 |
248 |
2.03 |
0.0427 |
| raceBlack |
-4157 |
1571 |
-2.65 |
0.0082 |
| raceHispanic |
148 |
1703 |
0.09 |
0.9307 |
| raceMixed Race (Non Hispanic) |
-9654 |
6396 |
-1.51 |
0.1313 |
| num.children.under.6 |
-1153 |
757 |
-1.52 |
0.1276 |
| highest.degree.earnedGED |
5009 |
3486 |
1.44 |
0.1509 |
| highest.degree.earnedHigh School Diploma |
11314 |
2972 |
3.81 |
0.0001 |
| highest.degree.earnedAssociate |
20896 |
3453 |
6.05 |
0.0000 |
| highest.degree.earnedBachelors |
28654 |
3139 |
9.13 |
0.0000 |
| highest.degree.earnedMasters |
35323 |
3627 |
9.74 |
0.0000 |
| highest.degree.earnedPhD |
59462 |
8850 |
6.72 |
0.0000 |
| highest.degree.earnedProfessional Degree (DDS, JD, MD) |
66356 |
7220 |
9.19 |
0.0000 |
| highest.degree.earnedNon-Interview |
14541 |
5675 |
2.56 |
0.0105 |
| highest.degree.earnedInvalid Skip |
22922 |
7363 |
3.11 |
0.0019 |
| marital.statusMarried |
7734 |
1249 |
6.19 |
0.0000 |
| marital.statusSeparated |
1285 |
4602 |
0.28 |
0.7801 |
| marital.statusDivorced |
3896 |
2026 |
1.92 |
0.0546 |
| marital.statusWidowed |
3372 |
10737 |
0.31 |
0.7535 |
| marital.statusInvalid Skip |
2728 |
6841 |
0.40 |
0.6902 |
| age.in.2017 |
547 |
388 |
1.41 |
0.1593 |
| sexMale:num.times.incarc |
-1295 |
4476 |
-0.29 |
0.7724 |
From the results of adding this interaction term, we can observe that there is a difference in the impact of incarceration on income depending on the study participants’ gender. Specifically, we see that for women, each additional incarceration results in a 1,663 dollar decline in their income. However, for men, each additional incarceration results in a 2,958 dollar decline in their income. Although these findings are in line with our tabular and graphical summaries for num.times.incarc, neither one of these findings is statistically significant.
In this model, we still observe that the sex, citizenship, highest grade completed by mom and dad, the highest degree earned, and the married marital status variables have a statistically significant assocations with income.
Testing the Significance of Adding the Sex * Number of Times Incarcerated Term
Now that the sex * num.times.incarc interaction variable was added to the regression, we also decided to test whether this addition was statistically signfiicant using an ANOVA test.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017 + sex:num.times.incarc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2346 1595000619595
## 2 2345 1594943701328 1 56918266 0.0837 0.7724
From this analysis, we can see that because the p-value for this test is 0.7723895, we would fail to reject the null hypothesis that there is no income gap between men and women based on the number of times that they are incarcerated. Put simply, the data suggests that the income gap between men and women does not vary based on the number of times an individual has been incarcerated.
Adding An Interaction Term of Sex * Longest.Length.Incarc.Months
We also saw from the initial tabular and graphical summaries that men, on average, served longer sentences when they were incarcerated than women. Therefore, we also wanted to add an interaction term between sex and longest.length.incarc.months to see if there were any differences in income by sex based on the longest length of time that an individual had been incarcerated.
| (Intercept) |
-8564 |
14115 |
-0.61 |
0.5441 |
| sexMale |
16599 |
3034 |
5.47 |
0.0000 |
| num.times.incarc |
-1210 |
6160 |
-0.20 |
0.8443 |
| longest.length.incarc.months |
-205 |
702 |
-0.29 |
0.7702 |
| condition.limiting.workYes |
-8854 |
2546 |
-3.48 |
0.0005 |
| condition.limiting.workValid Skip |
-32831 |
26145 |
-1.26 |
0.2093 |
| condition.limiting.workDon’t Know |
-42907 |
26188 |
-1.64 |
0.1015 |
| condition.limiting.workRefused to Answer |
18078 |
26278 |
0.69 |
0.4915 |
| citizenshipUnknown, not born in the U.S. |
5231 |
3177 |
1.65 |
0.0998 |
| citizenshipUnknown, birthplace unknown |
5667 |
2289 |
2.48 |
0.0134 |
| urban.status.age.12Rural |
-2611 |
1279 |
-2.04 |
0.0414 |
| urban.status.age.12Unknown |
-1542 |
5631 |
-0.27 |
0.7843 |
| residential.dad.highest.grade.completed |
133 |
231 |
0.58 |
0.5646 |
| residential.mom.highest.grade.completed |
503 |
248 |
2.03 |
0.0426 |
| raceBlack |
-4161 |
1572 |
-2.65 |
0.0082 |
| raceHispanic |
147 |
1703 |
0.09 |
0.9312 |
| raceMixed Race (Non Hispanic) |
-9656 |
6398 |
-1.51 |
0.1313 |
| num.children.under.6 |
-1153 |
757 |
-1.52 |
0.1278 |
| highest.degree.earnedGED |
5027 |
3491 |
1.44 |
0.1500 |
| highest.degree.earnedHigh School Diploma |
11312 |
2973 |
3.81 |
0.0001 |
| highest.degree.earnedAssociate |
20897 |
3454 |
6.05 |
0.0000 |
| highest.degree.earnedBachelors |
28653 |
3139 |
9.13 |
0.0000 |
| highest.degree.earnedMasters |
35321 |
3628 |
9.74 |
0.0000 |
| highest.degree.earnedPhD |
59460 |
8852 |
6.72 |
0.0000 |
| highest.degree.earnedProfessional Degree (DDS, JD, MD) |
66354 |
7221 |
9.19 |
0.0000 |
| highest.degree.earnedNon-Interview |
14541 |
5676 |
2.56 |
0.0105 |
| highest.degree.earnedInvalid Skip |
22923 |
7364 |
3.11 |
0.0019 |
| marital.statusMarried |
7734 |
1249 |
6.19 |
0.0000 |
| marital.statusSeparated |
1283 |
4603 |
0.28 |
0.7804 |
| marital.statusDivorced |
3903 |
2028 |
1.92 |
0.0544 |
| marital.statusWidowed |
3373 |
10739 |
0.31 |
0.7535 |
| marital.statusInvalid Skip |
2727 |
6843 |
0.40 |
0.6902 |
| age.in.2017 |
548 |
389 |
1.41 |
0.1587 |
| sexMale:num.times.incarc |
-1759 |
6314 |
-0.28 |
0.7806 |
| sexMale:longest.length.incarc.months |
74 |
708 |
0.10 |
0.9170 |
The results from this regression model indicate that for each additional month that a woman serves as part of a criminal sentence, her income decreases by 205 dollars. For men, however, their income decreases by 131 dollars. This suggests that the longest.length.incarc.months depresses the income of women to a slightly higher degree than it does for men. However, neither the interaction term with longest.length.incarc.months or the variable itself are statistically significant in this model.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017 + sex:num.times.incarc
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017 + sex:num.times.incarc + sex:longest.length.incarc.months
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2345 1594943701328
## 2 2344 1594936307897 1 7393431 0.0109 0.917
An ANOVA test also shows that the addition of this interaction term did not have a statistically significant impact on explaining income gaps observed in the model.
Model 5: Distilling the Impact of Gender & Race on Income
Is Race Statistically Significant?
Based on the tabular summaries and graphs we saw above, race seems to have a considerable effect on your income but we cannot be certain of this effect until we run an ANOVA test.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2259 3204242937410
## 2 2256 3197889077219 3 6353860191 1.4941 0.2142
When we run the ANOVA test we see the p-value of 0.2141735 is large enough that race cannot be assessed to be a statistically significant predictor of income when income is topcoded. But that doesn’t tell the full story.
Viewing the Effects of Top-Coding on the Interaction between Race and Income
We created a second linear model where we removed the top-coding of income earlier, but how much was that top-coding interacting with our race variable? We ran an ANOVA to test if race is still not statistically significant if we remove top coding.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2259 3204242937410
## 2 2256 3197889077219 3 6353860191 1.4941 0.2142
Now when we run the ANOVA test we see the p-value of 0.0233493 is sufficiently small for us to say that race actually is a statistically significant predictor of income, even when taken with the rest of our model. This shows that top-coding can have significant effects in changing what variables are taken into consideration.
We see clues of this when we look at our second model reviewing the effects of top coding income. In our topcoded data set the coefficient for Black is not statistically significant, but when we remove top coding from the income variable this is no longer true and the coefficient for Black meets the threshold for statistical significance at at a significance level of 0.05 with a p-value of 0.0080344. This is similar for other coefficients in the dataset that depress income which we will expand on in our analysis.
For this reason we’ll use our linear model without top-coding to test how race effects income.
Adding an interaction term between sex and race
The plots and tabular summaries above indicated that race had a significant impact on income, and that variance was not evenly distributed between the sexes. It is clear that Black men and women make less than Non-Black/Non Hispanic men and women. To explore this further we decided to dig into gender differences in income based on racial background.
| (Intercept) |
-9817 |
13769 |
-0.71 |
0.4759 |
| sexMale |
17486 |
1375 |
12.72 |
0.0000 |
| num.times.incarc |
-2907 |
1333 |
-2.18 |
0.0293 |
| longest.length.incarc.months |
-115 |
93 |
-1.24 |
0.2143 |
| condition.limiting.workYes |
-8994 |
2543 |
-3.54 |
0.0004 |
| condition.limiting.workValid Skip |
-32231 |
26119 |
-1.23 |
0.2173 |
| condition.limiting.workDon’t Know |
-43268 |
26177 |
-1.65 |
0.0985 |
| condition.limiting.workRefused to Answer |
15288 |
26274 |
0.58 |
0.5607 |
| citizenshipUnknown, not born in the U.S. |
5212 |
3174 |
1.64 |
0.1006 |
| citizenshipUnknown, birthplace unknown |
5811 |
2289 |
2.54 |
0.0112 |
| urban.status.age.12Rural |
-2400 |
1282 |
-1.87 |
0.0613 |
| urban.status.age.12Unknown |
-1524 |
5625 |
-0.27 |
0.7865 |
| residential.dad.highest.grade.completed |
121 |
230 |
0.52 |
0.6007 |
| residential.mom.highest.grade.completed |
509 |
248 |
2.05 |
0.0402 |
| raceBlack |
-708 |
2160 |
-0.33 |
0.7433 |
| raceHispanic |
253 |
2258 |
0.11 |
0.9109 |
| raceMixed Race (Non Hispanic) |
-1134 |
11725 |
-0.10 |
0.9230 |
| num.children.under.6 |
-1308 |
758 |
-1.73 |
0.0844 |
| highest.degree.earnedGED |
4741 |
3486 |
1.36 |
0.1739 |
| highest.degree.earnedHigh School Diploma |
11194 |
2970 |
3.77 |
0.0002 |
| highest.degree.earnedAssociate |
20781 |
3451 |
6.02 |
0.0000 |
| highest.degree.earnedBachelors |
28508 |
3137 |
9.09 |
0.0000 |
| highest.degree.earnedMasters |
35162 |
3626 |
9.70 |
0.0000 |
| highest.degree.earnedPhD |
59267 |
8845 |
6.70 |
0.0000 |
| highest.degree.earnedProfessional Degree (DDS, JD, MD) |
66352 |
7217 |
9.19 |
0.0000 |
| highest.degree.earnedNon-Interview |
13728 |
5677 |
2.42 |
0.0157 |
| highest.degree.earnedInvalid Skip |
22896 |
7356 |
3.11 |
0.0019 |
| marital.statusMarried |
7981 |
1252 |
6.38 |
0.0000 |
| marital.statusSeparated |
1355 |
4598 |
0.29 |
0.7682 |
| marital.statusDivorced |
4207 |
2028 |
2.07 |
0.0381 |
| marital.statusWidowed |
4515 |
10739 |
0.42 |
0.6742 |
| marital.statusInvalid Skip |
2651 |
6835 |
0.39 |
0.6982 |
| age.in.2017 |
578 |
388 |
1.49 |
0.1370 |
| sexMale:raceBlack |
-7177 |
3064 |
-2.34 |
0.0192 |
| sexMale:raceHispanic |
-144 |
2768 |
-0.05 |
0.9584 |
| sexMale:raceMixed Race (Non Hispanic) |
-12211 |
13960 |
-0.87 |
0.3818 |
The summary of this regression model with the addition of an interaction term between race and sex offers some very interesting results that ripple across all aspects of our analysis.
How Large is the Income Gap Between Different Racial Groups?
Adding the sex-race interaction offer some interesting insights into how race and sex affect income, and how income gaps vary across different racial groups.
Mixed Race, Non-Hispanic men make 13,345 dollars less than their Non-Black, Non-Hispanic counterparts. This is the largest income differential that we observed by analyzing the income gap by sex and race.
The biggest gap in income between women based on race also exists between Mixed Race, Non-Hispanic women and Non-Black, Non-Hispanic women. Mixed Race, Non-Hispanic women make 1,134 dollars less than their Non-Black, Non-Hispanic counterparts.
Neither of these findings, however, are statistically significant at a significance level of 0.05.
How Large is the Income Gender Gap by Racial Group?
The income gap between men and women is relatively close in size between Hispanic men and Hispanic women with a gap of 17,342 dollars and Non-Black/Non-Hispanic with a gap of 17,486 dollars. These are the largest income gaps, but the coefficient for Hispanic men is not statistically significant.
The income gap shrinks considerably when looking at Black men and women, with a gap of 10,309 dollars.
Other Effects
Another effect of adding an interaction variable between sex and race is that being divorced which was on the edge of statistical significance in the original model, with a p-value of 0.0537522, would now be considered statistically significant with a p-value of 0.038099 when our sex * race interaction interaction variable is added. This makes sense as it follows the general trend showing how a few incomes at the top levels were skewing the results across multiple other variables when they were supressed by a topcoded value.




Residual vs Fitted Plot
The residual vs. fitted plot continue to show that, for the lower fitted values, there is clustering around 0 and as the fitted values increase, deviation away from a residual of 0 increases, suggesting that this model is still overestimating income in lower income brackets, very slightly underestimating income in the middle brackets, and overestimating income in higher income brackets. The deviation in the upper tail is mostly similar to Model #2 but corrects slightly for the overestimation of income in the higher brackets.
Normal QQ Plot
The normal QQ plot shows a very similar trend to the residual vs. fitted plot in the Model #2 as we might expect from looking at the residual vs. fitted and while there are still large deviations in the upper tail, this model corrects very slightly for incomes in higher brackets.
Scale-Location Plot
As the fitted values increase, the variance still increases steadily, holding nearly identically with Model #2 but with some slight variation at the very top and bottom of the range.
Outliers & The Residuals vs Leverage Plot
The residuals vs leverage plot shows that the outliers in Model #2 remain present and adds some additional outliers as your income rises.
Conclusions from the Plot
Adding the sex*race interaction term appears to change very little that removing top coding from income didn’t already change, while it addresses some minor issues with the model it appears to also introduce new outliers making it appear potentially redundant.
How does adding the sex/race interaction variable change predictions for the Income Gender Gap by Race?
Now that we’ve reviewed the analysis of the gender income gap we wanted to see exactly how large this gender based income gap was between different racial groups in the main effects model versus our model adding a sex*race interaction term.
| Non-Black/ Non-Hispanic |
13177 |
| Black |
5356 |
| Hispanic |
13563 |
| Mixed Race (Non Hispanic) |
14836 |
It appears the the main effects model is effective in showing general trends but misses the mark when predicting values for the size of the income gap by sex. At 10,309 , the income gap between Black men and women is nearly double what is predicted by the main effects model. However, we see that the similarity in income gap between Hispanic men and women and Non-Black/Non-Hispanic men and women is pretty accurate but it does underestimate the size of that gap by about 4,000 dollars.
The main effects model is most inaccurate when considering Mixed Race, Non-Hispanic respondents. It predicts a 14,836 dollar gap which is far off the 5,275 dollar gap shown by the model. We suspect this is due to the small sample size present of Mixed Race, Non-Hispanic respondents as compared to other groups.
Testing the Significance of Adding an Interaction Variable on the Sex * Race Term
In order to determine how we should proceed with the insights gleaned from adding a sex * raceinteraction variable we tested the statistical significance of our findings with an ANOVA test.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017 + sex:race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2346 1595000619595
## 2 2343 1590698444288 3 4302175307 2.1123 0.09663 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reviewing our results, we see that while the p-value of 0.096629 meets what we wouldlook for when assessing for 90% confidence of statistical significance, it falls short of the threshold required to reject the null hypothesis. Therefore we fail to reject the null hypothesis that the income gap is the same across all racial categories at a significance level of 0.05.
Model 6: Distilling the Impact of Marital Status by Sex on Income
Is the Effect of Marital Status Statistically Significant?:
In Model #1 and Model #2 we see that being married has a statistically significant effect on your income. Notably though, in Model 1 being divorced does not have a significant effect and divorce falls just slightly out of the range of having a statistically significant effect in Model #2. In Model #3 divorce also becomes statistically significant. To see if marital status is a statistically significant variable in our model we ran an ANOVA test.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2351 1621581006295
## 2 2346 1595000619595 5 26580386700 7.8191 0.0000002608 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we run the ANOVA test we see the p-value of 0.0000003 is sufficiently small for us to say that marital status is a statistically significant predictor of income, and is influencing our model.
Adding an interaction term between sex and marital status
Our tabular and graphical summaries above illustrated clearly that marital status is not uniform between men and women, with more women being married and divorced. The ANOVA test above additionally tells us that marital status has a statistically significant effect on income. With these facts in mind, we created an interaction variable for sex * marital.status to see how marital status interacts with sex to influence our model.
| (Intercept) |
-4288 |
13663 |
-0.31 |
0.7536 |
| sexMale |
7580 |
1884 |
4.02 |
0.0001 |
| num.times.incarc |
-2568 |
1327 |
-1.94 |
0.0530 |
| longest.length.incarc.months |
-117 |
92 |
-1.28 |
0.2002 |
| condition.limiting.workYes |
-8927 |
2528 |
-3.53 |
0.0004 |
| condition.limiting.workValid Skip |
-30238 |
25942 |
-1.17 |
0.2439 |
| condition.limiting.workDon’t Know |
-39026 |
25989 |
-1.50 |
0.1333 |
| condition.limiting.workRefused to Answer |
19920 |
26119 |
0.76 |
0.4457 |
| citizenshipUnknown, not born in the U.S. |
5288 |
3152 |
1.68 |
0.0936 |
| citizenshipUnknown, birthplace unknown |
6019 |
2275 |
2.65 |
0.0082 |
| urban.status.age.12Rural |
-2386 |
1269 |
-1.88 |
0.0601 |
| urban.status.age.12Unknown |
-1231 |
5591 |
-0.22 |
0.8257 |
| residential.dad.highest.grade.completed |
92 |
229 |
0.40 |
0.6877 |
| residential.mom.highest.grade.completed |
589 |
247 |
2.39 |
0.0169 |
| raceBlack |
-5132 |
1567 |
-3.27 |
0.0011 |
| raceHispanic |
242 |
1691 |
0.14 |
0.8863 |
| raceMixed Race (Non Hispanic) |
-9914 |
6351 |
-1.56 |
0.1186 |
| num.children.under.6 |
-1277 |
752 |
-1.70 |
0.0894 |
| highest.degree.earnedGED |
5418 |
3462 |
1.57 |
0.1177 |
| highest.degree.earnedHigh School Diploma |
11556 |
2949 |
3.92 |
0.0001 |
| highest.degree.earnedAssociate |
21354 |
3429 |
6.23 |
0.0000 |
| highest.degree.earnedBachelors |
28629 |
3116 |
9.19 |
0.0000 |
| highest.degree.earnedMasters |
35075 |
3602 |
9.74 |
0.0000 |
| highest.degree.earnedPhD |
59911 |
8785 |
6.82 |
0.0000 |
| highest.degree.earnedProfessional Degree (DDS, JD, MD) |
66555 |
7166 |
9.29 |
0.0000 |
| highest.degree.earnedNon-Interview |
14157 |
5634 |
2.51 |
0.0121 |
| highest.degree.earnedInvalid Skip |
23532 |
7307 |
3.22 |
0.0013 |
| marital.statusMarried |
296 |
1773 |
0.17 |
0.8673 |
| marital.statusSeparated |
3119 |
6634 |
0.47 |
0.6383 |
| marital.statusDivorced |
-1142 |
2762 |
-0.41 |
0.6793 |
| marital.statusWidowed |
-10762 |
11714 |
-0.92 |
0.3583 |
| marital.statusInvalid Skip |
-11996 |
11683 |
-1.03 |
0.3046 |
| age.in.2017 |
553 |
386 |
1.44 |
0.1513 |
| sexMale:marital.statusMarried |
13727 |
2354 |
5.83 |
0.0000 |
| sexMale:marital.statusSeparated |
-3736 |
9123 |
-0.41 |
0.6822 |
| sexMale:marital.statusDivorced |
9161 |
3990 |
2.30 |
0.0218 |
| sexMale:marital.statusWidowed |
64308 |
28500 |
2.26 |
0.0241 |
| sexMale:marital.statusInvalid Skip |
23355 |
14314 |
1.63 |
0.1029 |
How Large is the Income Gap Between Marital Statuses?
From the summary we see a number of very interesting results.
We see that being married has a strong positive impact on income with married men making 4,566 dollars more on average than men who are divorced and significantly more than separated men at 17,463 dollars.
This positive correlation with income is also true for married women who on average make 1,438 dollars more than their counterparts who are divorced.
However being separated has far more positive correlation with income for women than men, with separated women making on average 2,823 dollars more than married women.
How Large is the Income Gender Gap between Different Marital Statuses?
We also see some interesting results regarding the gender differential for similar marital statuses. First, most strikingly, we see a striking 71,888 dollar income gap between men who have been widowed and women who have been widowed, and the result is highly statistically significant, suggesting losing a spouse has a far more significant impact on women’s income than mens.
It is important to note that the sample size for women and men who were widowed is very small. In fact, only 4 men and 19 women in the entire dataset have been widowed. These small sample sizes likley explain at least some of the high irregularity of the result regarding widows compared to other marital status groups.
This is a fair bit larger than the 21,307 dollar difference for married men and women who appear most frequently and in our model, and the difference is statistically significant.
But both outsize the only 3,844 dollar income gap between men and women who are classified as separated. While the income gap persists across all marital statuses, the relative size of that income gap is heavily dependent on the specifics of the marital status.




Residual vs Fitted Plot
The residual vs. fitted plot continues to show that, for the lower fitted values, there is clustering around 0 and as the fitted values increase, deviation away from a residual of 0 increases, suggesting that this model is still overestimating income in lower income brackets. However, adding the interaction variable between sex and marital status does seem to correct much of the overestimation of income in the higher brackets.
Normal QQ Plot
The normal QQ plot shows a mostly similar trend to the residual vs. fitted plot in Models 2 and 3, but as the residual vs. fitted plot alludes to there is measurable correction for incomes in the higher tail of the range, but there is still a trend of significant deviation.
Scale-Location Plot
As the fitted values increase, the variance still increases steadily like Model #2 and #3 but the increase is sharper at the bottom of the range and slightly flatter at the top.
Outliers & The Residuals vs Leverage Plot
The residuals vs leverage plot show that many of the outliers in Model #2 remain present but it corrects for some of the outliers toward the top of the income range with standardized residuals below 0.
Conclusions from the Plot
Adding the sex*marital coefficient appears to make small but meaningful changes at the higher end of the income spectrum that removing top coding from income didn’t address. It did not solve some of the issues with the model at the lower end of the range, but the better fit at the top of the range suggest the addition of a sex*marital interaction term improves the model in a meaningful way.
Predicted Income Gender Gap between Different Marital Statuses with and without the Interaction Variable
| Never-marrried |
5804 |
| Married |
17652 |
| Separated |
-4974 |
| Divorced |
10690 |
| Widowed |
31667 |
| Non-Interview |
NaN |
| Invalid Skip |
12499 |
We see from looking at the predicted income gap from the main effects model that while the trend of the values remains the same, the addition of an interaction variable accounts for more of a significant gap in income between widowed men and women and offers a clearer picture of the income gap between married men and women.
Considering the way in which the marital*sex interaction term improved the fit of the topcoded model (Model #2) at the top of the income range, the very high value of the widowed coefficient for men at 7,580 dollars suggests that a small number of high earners who have been widowed were potentially skewing the results of the model slightly.
The main effects model does however catch the positive effect that separation has on womens income and the negative effect it has on mens and is one of the lone negative coefficients for men’s income we’ve seen in our analysis.
Testing the Significance of Adding an Interaction Variable on the Sex * Marital Status Term
To determine if the effects of the addition of the sex^marital coefficient to the model were statistically significant we ran an ANOVA test comparing our two models.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017 + sex:marital.status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2346 1595000619595
## 2 2341 1567735263642 5 27265355953 8.1427 0.0000001246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the p-value of 0.0000001 meets the threshold required to reject the null hypothesis. Therefore we can reject the null hypothesis that the income gap between men and women is the same across all marital statuses and instead conclude that marital status has a statistically significant effect on the income gap between the sexes.
Model 7: Distilling the Impact of Education on Income Across Sexes
Is the effect of greater education statistically significant?
The most obvious factor for variance in income between individuals both intuitively and from the tabular summaries and plots above is education. It appears from looking at other models that the difference in income scales upward as you receive more education, so we decided to look further into the highest degree held variable to better understand the effect education has on income for the different sexes. The first question we asked is if education is statistically significant and if so, how significant is it?
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + marital.status + age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2355 1810482639239
## 2 2346 1595000619595 9 215482019644 35.216 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we run the ANOVA test we see the p-value of 7.939487110^{-59} is sufficiently small for us to say that your highest degree earned is a highly statistically significant predictor of income, and is strongly influencing our model.
Adding an Interaction Term Between Sex and Highest Degree Earned
The tabular summaries and plots above indicate a wide variation between educational background of respondents. While it is clear that education significantly affects income, as the ANOVA test confirmed, the nuances of how specific educational backgrounds affect income is less clearly discernible from the data. To better understand this we added an interaction variable for sex * highest.degree.earned to see how educational background interacts with sex to influence our model.
| (Intercept) |
-10041 |
14217 |
-0.71 |
0.4801 |
| sexMale |
19734 |
5851 |
3.37 |
0.0008 |
| num.times.incarc |
-2811 |
1345 |
-2.09 |
0.0367 |
| longest.length.incarc.months |
-133 |
92 |
-1.44 |
0.1499 |
| condition.limiting.workYes |
-8955 |
2549 |
-3.51 |
0.0005 |
| condition.limiting.workValid Skip |
-32121 |
26169 |
-1.23 |
0.2198 |
| condition.limiting.workDon’t Know |
-43278 |
26206 |
-1.65 |
0.0988 |
| condition.limiting.workRefused to Answer |
16295 |
26351 |
0.62 |
0.5364 |
| citizenshipUnknown, not born in the U.S. |
5609 |
3189 |
1.76 |
0.0787 |
| citizenshipUnknown, birthplace unknown |
5968 |
2300 |
2.59 |
0.0095 |
| urban.status.age.12Rural |
-2620 |
1280 |
-2.05 |
0.0408 |
| urban.status.age.12Unknown |
-1574 |
5645 |
-0.28 |
0.7803 |
| residential.dad.highest.grade.completed |
144 |
231 |
0.62 |
0.5329 |
| residential.mom.highest.grade.completed |
501 |
248 |
2.02 |
0.0438 |
| raceBlack |
-4317 |
1575 |
-2.74 |
0.0062 |
| raceHispanic |
-49 |
1709 |
-0.03 |
0.9772 |
| raceMixed Race (Non Hispanic) |
-9356 |
6408 |
-1.46 |
0.1444 |
| num.children.under.6 |
-1126 |
759 |
-1.48 |
0.1381 |
| highest.degree.earnedGED |
10487 |
5736 |
1.83 |
0.0676 |
| highest.degree.earnedHigh School Diploma |
12772 |
4830 |
2.64 |
0.0082 |
| highest.degree.earnedAssociate |
23703 |
5422 |
4.37 |
0.0000 |
| highest.degree.earnedBachelors |
31045 |
4908 |
6.33 |
0.0000 |
| highest.degree.earnedMasters |
39375 |
5465 |
7.20 |
0.0000 |
| highest.degree.earnedPhD |
59078 |
10989 |
5.38 |
0.0000 |
| highest.degree.earnedProfessional Degree (DDS, JD, MD) |
70255 |
9551 |
7.36 |
0.0000 |
| highest.degree.earnedNon-Interview |
14833 |
7246 |
2.05 |
0.0408 |
| highest.degree.earnedInvalid Skip |
27112 |
10935 |
2.48 |
0.0132 |
| marital.statusMarried |
7760 |
1251 |
6.21 |
0.0000 |
| marital.statusSeparated |
1730 |
4617 |
0.37 |
0.7079 |
| marital.statusDivorced |
3981 |
2031 |
1.96 |
0.0500 |
| marital.statusWidowed |
3536 |
10767 |
0.33 |
0.7426 |
| marital.statusInvalid Skip |
2729 |
6851 |
0.40 |
0.6904 |
| age.in.2017 |
531 |
389 |
1.36 |
0.1725 |
| sexMale:highest.degree.earnedGED |
-8668 |
7207 |
-1.20 |
0.2292 |
| sexMale:highest.degree.earnedHigh School Diploma |
-2116 |
6083 |
-0.35 |
0.7280 |
| sexMale:highest.degree.earnedAssociate |
-4614 |
6995 |
-0.66 |
0.5096 |
| sexMale:highest.degree.earnedBachelors |
-3881 |
6191 |
-0.63 |
0.5308 |
| sexMale:highest.degree.earnedMasters |
-7640 |
7139 |
-1.07 |
0.2846 |
| sexMale:highest.degree.earnedPhD |
4977 |
18958 |
0.26 |
0.7929 |
| sexMale:highest.degree.earnedProfessional Degree (DDS, JD, MD) |
-8082 |
14703 |
-0.55 |
0.5826 |
| sexMale:highest.degree.earnedNon-Interview |
594 |
7570 |
0.08 |
0.9375 |
| sexMale:highest.degree.earnedInvalid Skip |
-7115 |
14762 |
-0.48 |
0.6299 |
Reviewing the Income Gender Gap by Educational Attainment
Adding the interaction of sex * highest.degree.earned shows that while the income gap by sex persists at all degrees earned, education can shrink it considerably. This appears to be universal with the exception of earning a PhD where the predicted income gap between a male PhD and female PhD is 24,711 dollars. This result is not statistically significant.
However, the income gap based on sex decreases notably for masters degrees at 12,094 dollars and for professionals degrees at 11,652 dollars.
The data also suggests that racial bias runs deep. Despite the presence of the interaction variable between sex * highest.degree.earned reducing the Hispanic coefficient to just 49 dollars and removing its statistical significance, Black men and women can still expect to make 4,317 dollars less than their Non-Black/Non-Hispanic counterparts and the result remains highly statistically significant.




Residual vs Fitted Plot
The residual vs. fitted plot shows for the lower fitted values, there is clustering around 0 and as the fitted values increase, deviation away from a residual of 0 increases, this is very similar to Model #2, but the deviation appears to be slightly more pronounced.
Normal QQ Plot
The normal QQ plot shows an almost identical match to Model #2, suggesting that education already has a great degree of influence on the model and is accounted for in Model #2. It is difficult to find any noticable variation on either upper or lower tail of the range.
Scale-Location Plot
As the fitted values increase, the variance still increases steadily very similarly to Model #2 and #3 but the trends present seem to have shifted slightly rightward, and show the most variation from Models 2 and 3 in the upper end of the income range.
Outliers & The Residuals vs Leverage Plot
The residuals vs leverage plot show that many of the outliers in Model #2 clustered around 0 have a greater spread suggesting more leverage than in the initial model.
Conclusions from the Plot
Adding the sex*education coefficient appears to make very slight changes toward the lower and upper ends of the range with some outliers having additional leverage in the model with the interaction coefficient than without, but the addition of the coefficient does little to correct exisiting issues with the model and appears to be largely redundant.
Predicted Income Gender Gap between Different Educational Attainment Levels with and without the Interaction Variable
We see from looking at the predicted income gap regarding education from the main effects model that the values are actually quite close to what we saw in our model with the exception of the PhD education value, therefore this interaction term between sex * highest.degree.earned may be redundant.
| No Degree Earned |
18156 |
| GED |
10820 |
| High School Diploma |
15988 |
| Associate |
11865 |
| Bachelors |
15887 |
| Masters |
13362 |
| PhD |
10750 |
| Professional Degree (DDS, JD, MD) |
17290 |
| Non-Interview |
11406 |
| Invalid Skip |
3601 |
The estimates in income gaps between men and women of different educational backgrounds is largely accurate. These findings are similar to the findings in our model with the sex*education interaction variable. Education tends to cut the income gap slightly, but it underestimates the effect of a masters degree in reducing income inequality with the interaction model showing an inequality of 12,094 dollars and the main effects model estimating an inequality of 13,362 dollars and similarly it underestimates the effect of a professional degree with the interaction model showing an income inequality for professional degrees of 11,652 dollars as opposed to the 17,290 dollars estimated by the main effects model.
Testing the Significance of Adding an Interaction Variable on the Sex * Highest Degree Earned Status Term
To test whether a sex * highest.degree.earned interaction variable is redundant we tested the new model against our original main effects model with an ANOVA test.
## Analysis of Variance Table
##
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months +
## condition.limiting.work + citizenship + urban.status.age.12 +
## residential.dad.highest.grade.completed + residential.mom.highest.grade.completed +
## race + num.children.under.6 + highest.degree.earned + marital.status +
## age.in.2017 + sex:highest.degree.earned
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2346 1595000619595
## 2 2337 1591771751744 9 3228867851 0.5267 0.8561
We see that the p-value of 0.8561254 is not even close to the threshold required to reject the null hypothesis. Therefore, we fail to reject the null hypothesis that the income gap between men and women is the same across similar educational attainment.